NAME = "Abhinav Asheesh Lugun"
ID = "122322"
From lecture, we know that members of the exponential family distributions can be written in the form $$p(y;\eta) = b(y)e^{(\eta^\top T(y)-a(\eta))},$$ where
Each choice of $T$, $a$, and $b$ defines a family (set) of distributions parameterized by $\eta$.
If we can write $p(y \mid \mathbf{x} ; \theta)$ as a member of the exponential family of distributions with parameters $\mathbf{\eta}$ with $\eta_i = \theta^\top_i \mathbf{x}$, we obtain a generalized linear model that can be optimized using the maximum likelihood principle.
The GLM for the Gaussian distribution with natural parameter $\eta$ being the mean of the Gaussian gives us ordinary linear regression.
The Bernoulli distribution with parameter $\phi$ can be written as an exponential distribution with natural parmeter $\eta = \log \frac{\phi}{1-\phi}$. The GLM for this distribution is logistic regression.
When we write the multinomial distribution with paremeters $\phi_i > 0$ for classes $i \in 1..K$ with the constraint that $$\sum_{i=1}^{K} \phi_i = 1$$ as a member of the exponential family, the resulting GLM is called multinomial logistic regression. The parameters $\phi_1, \ldots, \phi_K$ are written in terms of $\theta$ as $$\phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$
In multinomial regression, we have
Data are pairs $\mathbf{x}^{(i)}, y^{(i)}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y \in 1..K$.
The hypothesis is a vector-valued function $$\mathbf{h}_\theta(\mathbf{x}) = \begin{bmatrix} p(y = 1 \mid \mathbf{x} ; \theta) \
p(y = 2 \mid \mathbf{x} ; \theta) \\
\vdots \\
p(y = K \mid \mathbf{x} ; \theta) \end{bmatrix},$$
where $$p(y = i \mid \mathbf{x}) = \phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$
We need a cost function and a way to minimize that cost function. As usual, we try to find the parameters maximizing the likelihood or log likelihood function, or equivalently, minimizing the negative log likelihood function:
$$\theta^* = \text{argmax}_\theta {\cal L}(\theta) = \text{argmax}_\theta \ell(\theta) = \text{argmin}_\theta J(\theta),$$where $$\begin{eqnarray} J(\theta) & = & - \ell(\theta) \\ & = & - \sum_{i=1}^m \log p(y^{(i)} \mid \textbf{x}^{(i)} ; \theta). \end{eqnarray}$$
Now that we know what is $J(\theta)$, let's try to find its minimimum by taking the derivatives with respect to an arbitrary parameter $\theta_{kl}$, the $l$-th element of the parameter vector $\theta_k$ for class $k$. Before we start, let's define a variable $a_k$ as the linear activation for class $k$ in the softmax function: $$ a_k = \theta_k^\top \mathbf{x}^{(i)}, $$ and rewrite the softmax more conveniently as $$ \phi_k = \frac{e^{a_k}}{\sum_{j=1}^K e^{a_j}}. $$ That makes it a little easier to compute the gradient: $$\begin{eqnarray} \frac{\partial J}{\partial \theta_{kl}} & = & - \sum_{i=1}^m \frac{1}{\phi_{y^{(i)}}} \frac{\partial \phi_{y^{(i)}}}{\partial \theta_{kl}}. \\ \end{eqnarray}$$ Using the chain rule, we have $$\frac{\partial \phi_{y^{(i)}}}{\partial \theta_{kl}} = \sum_{j=1}^K \frac{\partial \phi_{y^{(i)}}}{\partial a_j} \frac{\partial a_j}{\partial \theta_{kl}}$$ The second factor is easy: $$ \frac{\partial a_j}{\partial \theta_{kl}} = \delta(k=j)x^{(i)}_l. $$ For the first factor, we have $$\begin{eqnarray} \frac{\partial \phi_{y^{(i)}}}{\partial a_j} & = & \frac{ \left[ \delta(y^{(i)}=j)e^{a_j} \sum_{c=1}^K e^{a_c} \right] - e^{a_j} e^{a_j} }{\left[ \sum_{c=1}^K e^{a_c} \right]^2} \\ & = & \delta(y^{(i)}=j) \phi_j - \phi_j^2 \end{eqnarray}$$
Substituting what we've derived into the definition above, we obtain $$ \frac{\partial J}{\theta_{kl}} = - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \frac{\partial a_j}{\partial \theta_{kl}}. $$
There are two ways to do the calculation. In deep neural networks with multinomial outputs, we want to first calculate the $\frac{\partial J}{\partial a_j}$ terms then use them to calculate $\frac{\partial J}{\partial \theta_{kl}}$.
However, if we only have the "single layer" model described up till now, we note that $$\frac{\partial a_j}{\partial \theta_{kl}} = \delta(j=k) x^{(i)}_l,$$ so we can simplify as follows: $$\begin{eqnarray} \frac{\partial J}{\theta_{kl}} & = & - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \frac{\partial a_j}{\partial \theta_{kl}} \\ & = & - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \delta(j=k) x^{(i)}_l \\ & = & - \sum_{i=1}^m (\delta(y^{(i)}=k) - \phi_k) x^{(i)}_l \\ \end{eqnarray}$$
OK! Now we have all 4 criteria for our multinomial regression model:
Data are pairs $\mathbf{x}^{(i)}, y^{(i)}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y \in 1..K$.
The hypothesis is a vector-valued function $$\mathbf{h}_\theta(\mathbf{x}) = \begin{bmatrix} p(y = 1 \mid \mathbf{x} ; \theta) \
p(y = 2 \mid \mathbf{x} ; \theta) \\
\vdots \\
p(y = K \mid \mathbf{x} ; \theta) \end{bmatrix},$$
where $$p(y = i \mid \mathbf{x}) = \phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$
The cost function is $$J(\theta) = - \sum_{i=1}^m \log p(y^{(i)} \mid \textbf{x}^{(i)})$$
The optimization algorithm is gradient descent on $J(\theta)$ with the update rule $$\theta_{kl}^{(n+1)} \leftarrow \theta_{kl}^{(n)} - \alpha \sum_{i=1}^m (\delta(y^{(i)}=k) - \phi_k) x^{(i)}_l.$$
The following example of multinomial logistic regression is from Kaggle.
The data set is the famous Iris dataset from the UCI machine learning repository.
The data contain 50 samples from each of three classes. Each class refers to a particular species of the iris plant. The data include four independent variables:
The target takes on one of three classes:
To predict the target value, we use multinomial logistic regression for $k=3$ classes i.e. $y \in \{ 1, 2, 3 \}$.
Given $\mathbf{x}$, we would like to predict a probability distribution over the three outcomes for $y$, i.e., $\phi_1 = p(y=1 \mid \mathbf{x})$, $\phi_2 = p(y=2 \mid \mathbf{x})$, and $\phi_3 = p(y=3 \mid \mathbf{x})$.
# importing libraries
import numpy as np
import pandas as pd
import random
import math
The phi function returns $\phi_i$ for input patterns $\mathtt{X}$ and parameters $\theta$.
def phi(i, theta, X, num_class):
"""
Here is how to make documentation for your function show up in intellisense.
Explanation you put here will be shown when you use it.
To get intellisense in your Jupyter notebook:
- Press 'TAB' after typing a dot (.) to see methods and attributes
- Press 'Shift+TAB' after typing a function name to see its documentation
The `phi` function returns phi_i = h_theta(x) for input patterns X and parameters theta.
Inputs:
i=index of phi
X=input dataset
theta=parameters
Returns:
phi_i
"""
mat_theta = np.matrix(theta[i])
mat_x = np.matrix(X)
num = math.exp(np.dot(mat_theta, mat_x.T))
den = 0
for j in range(0,num_class):
mat_theta_j = np.matrix(theta[j])
den = den + math.exp(np.dot(mat_theta_j, mat_x.T))
phi_i = num / den
return phi_i
Tips for using intellisense: Shift+TAB
The grad_cost function gives the gradient of the cost for data $\mathtt{X}, \mathbf{y}$ for class $j\in 1..k$.
def indicator(i, j):
'''
Check whether i is equal to j
Return:
1 when i=j, otherwise 0
'''
if i == j: return 1
else: return 0
def grad_cost(X, y, j, theta, num_class):
'''
Compute the gradient of the cost function for data X, y for parameters of
output for class j in 1..k
'''
m, n = X.shape
sum = np.array([0 for i in range(0,n)])
for i in range(0, m):
p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
sum = sum + (X.loc[i] * p)
grad = -sum / m
return grad
def gradient_descent(X, y, theta, alpha, iters, num_class):
'''
Perform iters iterations of gradient descent: theta_new = theta_old - alpha * cost
'''
n = X.shape[1]
for iter in range(iters):
dtheta = np.zeros((num_class, n))
for j in range(0, num_class):
dtheta[j,:] = grad_cost(X, y, j, theta, num_class)
theta = theta - alpha * dtheta
return theta
def h(X, theta, num_class):
'''
Hypothesis function: h_theta(X) = theta * X
'''
X = np.matrix(X)
h_matrix = np.empty((num_class,1))
den = 0
for j in range(0, num_class):
den = den + math.exp(np.dot(theta[j], X.T))
for i in range(0,num_class):
h_matrix[i] = math.exp(np.dot(theta[i], X.T))
h_matrix = h_matrix / den
return h_matrix
Create a function to load data from Iris.csv using the Pandas library and extract y from the data.
You can use the Pandas 10 minute guide to learn how to use pandas.
def load_data(file_name, drop_label, y_label, is_print=False):
# 1. Load csv file
data = pd.read_csv(file_name)
if is_print:
print(data.head())
# 2. remove 'Id' column from data
if drop_label is not None:
data = data.drop([drop_label],axis=1)
if is_print:
print(data.head())
# 3. Extract y_label column as y from data
y = None
# 4. get index of y-column
y_index = data.columns.get_loc(y_label)
# 5. Extrack X features from data
X = None
# YOUR CODE HERE
X = data.iloc[:, 0:y_index]
y = data.iloc[:, y_index]
return X, y
cell-7b5c0f18770fbccb
Score: 5.0 / 5.0 (Top)
X, y = load_data('Iris.csv', 'Id', 'Species', True)
print(X.head())
print(y[:5])
# Test function: Do not remove
# tips: this is how to create dataset using pandas
d_ex = {'ID': [ 1, 2, 3, 4, 5, 6, 7],
'Grade': [3.5, 2.5, 3.0, 3.75, 2.83, 3.95, 2.68],
'Type': ['A', 'B', 'C', 'A', 'C', 'A', 'B']
}
df = pd.DataFrame (d_ex, columns = ['ID','Grade', 'Type'])
df.to_csv('out.csv', index=False)
Xtest, ytest = load_data('out.csv', 'ID', 'Type')
assert len(Xtest.columns) == 1, 'number of X_columns incorrect (1)'
assert ytest.name == 'Type', 'Extract y_column is incorrect (1)'
assert ytest.shape == (7,), 'number of y is incorrect (1)'
assert 'Grade' in Xtest.columns, 'Incorrect columns in X (1)'
Xtest, ytest = load_data('out.csv', None, 'Type')
assert len(Xtest.columns) == 2, 'number of X_columns incorrect (2)'
assert ytest.name == 'Type', 'Extract y_column is incorrect (2)'
assert ytest.shape == (7,), 'number of y is incorrect (2)'
assert 'Grade' in Xtest.columns and 'ID' in Xtest.columns, 'Incorrect columns in X (2)'
import os
os.remove('out.csv')
assert len(X.columns) == 4, 'number of X_columns incorrect (3)'
assert 'SepalWidthCm' in X.columns and 'Id' not in X.columns and 'Species' not in X.columns, 'Incorrect columns in X (3)'
assert y.name == 'Species', 'Extract y_column is incorrect (3)'
assert y.shape == (150,), 'number of y is incorrect (3)'
print("success!")
# End Test function
Expected result: \ SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 0 5.1 3.5 1.4 0.2\ 1 4.9 3.0 1.4 0.2\ 2 4.7 3.2 1.3 0.2\ 3 4.6 3.1 1.5 0.2\ 4 5.0 3.6 1.4 0.2\ 0 Iris-setosa\ 1 Iris-setosa\ 2 Iris-setosa\ 3 Iris-setosa\ 4 Iris-setosa\ Name: Species, dtype: object
Partition data into training and test sets
def encode (y, y_labels_name):
for i in range (len(y)):
if y[i] == y_labels_name[0]:
y[i] = 0
elif y[i] == y_labels_name[1]:
y[i] = 1
elif y[i] == y_labels_name[2]:
y[i] = 2
return y
def partition(X, y, percent_train):
# 1. create index list
# 2. shuffle index
# 3. Create train/test index
# 4. Separate X_Train, y_train, X_test, y_test
# 5. Get y_labels_name from y using pandas.unique function
# 6. Change y_labels_name into string number and put into y_labels_new
# 7. Drop shuffle index columns
# - pandas.reset_index() and pandas.drop(...) might be help
y_labels_name = None
y_labels_new = None
# YOUR CODE HERE
idx = np.arange(0,y.shape[0])
random.shuffle(idx)
partition_index = int(len(idx) * percent_train)
train_idx = idx[:partition_index]
test_idx = idx[partition_index:]
X_train = X.iloc[train_idx].reset_index(drop=True)
y_train = y.iloc[train_idx].reset_index(drop=True).copy()
X_test = X.iloc[test_idx].reset_index(drop=True)
y_test = y.iloc[test_idx].reset_index(drop=True).copy()
y_labels_name = pd.unique(y)
y_labels_new = np.array([0,1,2])
y_train = encode(y_train, y_labels_name)
y_test = encode(y_test, y_labels_name)
return idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new
cell-57199b8ae505edd8
Score: 10.0 / 10.0 (Top)
percent_train = 0.7
idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new = partition(X, y, percent_train)
print('X_train.shape', X_train.shape)
print('X_test.shape', X_test.shape)
print('y_train.shape', y_train.shape)
print('y_test.shape', y_test.shape)
print('y_labels_name: ', y_labels_name)
print('y_labels_new: ', y_labels_new)
print(X_train.head())
print(y_train.head())
# Test function: Do not remove
assert len(y_labels_name) == 3 and len(y_labels_new) == 3, 'number of y uniques are incorrect'
assert X_train.shape == (105, 4), 'Size of X_train is incorrect'
assert X_test.shape == (45, 4), 'Size of x_test is incorrect'
assert y_train.shape == (105, ), 'Size of y_train is incorrect'
assert y_test.shape == (45, ), 'Size of y_test is incorrect'
assert 'Iris-setosa' in y_labels_name and 'Iris-virginica' in y_labels_name and \
'Iris-versicolor' in y_labels_name, 'y unique data incorrect'
assert min(y_labels_new) == 0 and max(y_labels_new) < 3, 'label indices are incorrect'
print("success!")
# End Test function
Expected result: (*or similar*)\ X_train.shape (105, 4)\ X_test.shape (45, 4)\ y_train.shape (105,)\ y_test.shape (45,)\ y_labels_name: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] \ y_labels_new: [0, 1, 2]
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\ 0 6.4 2.8 5.6 2.2\ 1 6.7 3.3 5.7 2.1\ 2 4.6 3.4 1.4 0.3\ 3 5.1 3.8 1.5 0.3\ 4 5.0 2.3 3.3 1.0\ Species\ 0 2\ 1 2\ 2 0\ 3 0\ 4 1
Train your classification model using the gradient_descent function already provided.
You might also play around with the gradient descent function to see if you can speed it up!
# num_class is the number of unique labels
num_class = len(y_labels_name)
if (X_train.shape[1] == X.shape[1]):
X_train.insert(0, "intercept", 1)
# Reset m and n for training data
r, c = X_train.shape
# Initialize theta for each class
theta_initial = np.ones((num_class, c))
alpha = .05
iterations = 200
theta = None
# Logistic regression
# YOUR CODE HERE
theta = gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
cell-4df1fd593dc59103
Score: 5.0 / 5.0 (Top)
print(theta)
print(theta.shape)
# Test function: Do not remove
assert theta.shape == (3, 5), 'Size of theta is incorrect'
print("success!")
# End Test function
Expected result: (*or similar*)\ [[ 1.17632192 1.32360047 1.83204165 -0.20224445 0.44039155]\ [ 1.10140069 1.13537321 0.74833178 1.21907866 0.82567377]\ [ 0.72227738 0.54102632 0.41962657 1.98316579 1.73393467]]\ \ (3, 5)
Let's get your model to make predictions on the test data.
# Prediction on test data
if (X_test.shape[1] == X.shape[1]):
X_test.insert(0, "intercept", 1)
# Reset m and n for test data
r,c = X_test.shape
y_pred = []
for index,row in X_test.iterrows(): # get a row of X_test data
# calculate y_hat using hypothesis function
y_hat = None
# find the index (integer value) of maximum value in y_hat and input back to prediction
prediction = None
# YOUR CODE HERE
y_hat = h(row, theta, num_class)
prediction = int(np.argmax(y_hat))
# collect the result
y_pred.append(prediction)
cell-eb837dfbea00f88e
Score: 5.0 / 5.0 (Top)
print(len(y_pred))
print(y_pred[:7])
print(type(y_pred[0]))
# Test function: Do not remove
assert len(y_pred) == 45, 'Size of y_pred is incorrect'
assert isinstance(y_pred[0], int) and isinstance(y_pred[15], int) and isinstance(y_pred[17], int), 'prediction type is incorrect'
assert max(y_pred) < 3 and min(y_pred) >= 0, 'wrong index of y_pred'
print("success!")
# End Test function
Expected result: (*or similar*)\ 45 \ [2, 0, 2, 0, 0, 0, 2] \
<class 'int'>
Estimate accuracy of model on test data
$$\text{accuracy} = \frac{\text{number of correct test predictions}}{m_{\text{test}}}$$def calc_accuracy(y_test, y_pred):
accuracy = None
# YOUR CODE HERE
num_cor_pred_bool = y_test == y_pred
num_cor_pred = 0
for boolean in num_cor_pred_bool:
if boolean:
num_cor_pred += 1;
accuracy = num_cor_pred / len(y_test)
return accuracy
cell-3c242c8ceb36bdde
Score: 5.0 / 5.0 (Top)
accuracy = calc_accuracy(y_test, y_pred)
print('Accuracy: %.4f' % accuracy)
# Test function: Do not remove
assert isinstance(accuracy, float), 'accuracy should be floating point'
assert accuracy >= 0.8, 'Did you train the data?'
print("success!")
# End Test function
Expected result: should be at least 0.8!
We will do the following in lab:
my_J() and implementdef my_J(theta, X, y, j, num_class):
cost = None
# YOUR CODE HERE
cost = - indicator(y,j) * math.log(phi(j, theta, X, num_class))
return cost
cell-fbcbb4e3e0175f5b
Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))
cost = my_J(test_theta, X_train.loc[10], y_train[10], 0, 3)
assert isinstance(cost, float), 'cost should be floating point'
print("success!")
# End Test function
my_grad_cost using your my_J functiondef my_grad_cost(X, y, j, theta, num_class):
grad = None
cost = None
# YOUR CODE HERE
m, n = X.shape
sum = np.array([0 for i in range(0,n)])
cost = 0.0
for i in range(0, m):
p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
sum = sum + (X.loc[i] * p)
cost += my_J(theta, X.loc[i], y_train[i], j, num_class)
grad = -sum / m
return grad, cost
cell-0c59178b69fc0e79
Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))
grad, cost = my_grad_cost(X_train, y_train, 0, test_theta, num_class)
print(grad)
print(cost)
assert isinstance(cost, float), 'cost should be floating point'
assert isinstance(grad['intercept'], float) and \
isinstance(grad['SepalLengthCm'], float) and \
isinstance(grad['SepalWidthCm'], float) and \
isinstance(grad['PetalLengthCm'], float) and \
isinstance(grad['PetalWidthCm'], float) , 'grad should be floating point'
print("success!")
# End Test function
Expect result: (*or similar*)\ intercept 0.009524\ SepalLengthCm 0.316825\ SepalWidthCm -0.091429\ PetalLengthCm 0.780000\ PetalWidthCm 0.329524\ dtype: float64\ 37.352817814715735
my_gradient_descent using your my_grad_cost functiondef my_gradient_descent(X, y, theta, alpha, iters, num_class):
cost_arr = []
# YOUR CODE HERE
n = X.shape[1]
for iter in range(iters):
dtheta = np.zeros((num_class, n))
total_cost = 0.0
for j in range(0, num_class):
dtheta[j,:],cost = my_grad_cost(X, y, j, theta, num_class)
total_cost += cost
theta = theta - alpha * dtheta
cost_arr.append(total_cost)
return theta, cost_arr
cell-8b7f6daed2ecf043
Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))
theta, cost = my_gradient_descent(X_train, y_train, theta_initial, 0.001, 5, 3)
print(theta)
print(cost)
print("success!")
# End Test function
Expected result: (*or similar*)\ [[1.00001186 0.99618853 1.00183642 0.9889817 0.99528923]\ [1.00009697 1.0011823 0.99883395 1.00316763 1.00083055]\ [0.99987915 1.00255606 0.99929351 1.00779768 1.00386218]]\ [114.00099216453735, 113.89036233839263, 113.78163144339288, 113.67472269747496, 113.56956268162737]\ 37.352817814715735
import matplotlib.pyplot as plt
theta_arr = []
cost_arr = []
accuracy_arr = []
# design your own learning rate and num iterations
alpha_arr = np.array([None, None, None, None])
iterations_arr = np.array([None, None, None, None])
# YOUR CODE HERE
alpha_arr = np.array([0.001, 0.02, 0.07, 0.1])
iterations_arr = np.array([200, 350, 400, 450])
for iterations in iterations_arr:
for alpha in alpha_arr:
theta_initial = np.ones((num_class, c))
theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
cost_arr.append(cost)
theta_arr.append(theta)
r,c = X_test.shape
y_pred = []
for index,row in X_test.iterrows():
y_hat = h(row, theta, num_class)
prediction = int(np.argmax(y_hat))
y_pred.append(prediction)
accuracy = calc_accuracy(y_test, y_pred)
accuracy_arr.append(accuracy)
i = 0
for iterations in iterations_arr:
plt.grid(axis='both')
for alpha in alpha_arr:
x_axis = np.arange(0,iterations)
cost = cost_arr[i]
label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
plt.plot(x_axis, cost,label=label)
i += 1
title = "For " + str(iterations) + " Iterations"
plt.title(title)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.legend()
plt.show()
plt.clf()
From the plots, it is seen that the learning rate of 0.1 and 450 number of iternations produced the best results. The cost reduced relatively more quickly in this configuration than other combinations. However, at one point during the training, the cost increased and performed worse than the model trained with a lower learning rate of 0.07 as observed from the graph. Though it started decreasing again shortly, the cost did not decrease smoothly. It is only after a certain iternation, did the cost of model decreased lower than the cost of other model and also smoothly. From this, it can be concluded that increasing learning more than this point might cause the model learn from data non-smoothly and might perform worse for a certain amount of iteration.
It is noted that the cost for model which performed the best would still decrese a bit and is still yet to reach its convergence. Maybe training for a longer number of iternations would slightly improve the model. One optimal algorithm which can be used for this problem is early stopping. It is basically a method that stops the training once model stops improving to avoid overfitting. Using this method, larger number of iternations can be defined and hence eliminates the task for defining optimal number of iteratinons to large extend.
Expected result: (*Yours doesn't have to be the same!*)
We see that the Iris dataset is pretty easy. Depending on the train/test split, we get 95-100% accuracy.
Find a more interesting multi-class classification problem on Kaggle (Tell the reference), clean the dataset to obtain numerical input features without missing values, split the data into test and train, and experiment with multinomial logistic regression.
Write a brief report on your experiments and results. As always, turn in a Jupyter notebook by email to the instructor and TA.
from IPython.display import Image
Image(filename='data-original.png')
The dataset which I used for take-home exercise was taken from 'Palmer Archipelago (Antarctica) penguin data' in Kaggle. It is a great dataset for practising data exploration & visualization and is seen as an iternative to iris dataset. For my exeriment, based on the information provided on penguins, I tried to create a suitable model which will guess the species of the penguin. The species given in this dataset are chinstrap, gentoo, and adelie.
Note: There were two datasets provided from 'Palmer Archipelago (Antarctica) penguin data'. One was the simplified version of the other data. The one I used was the simplified version since it contained all the important information and omits unnecessary details.
def load_data(file_name, drop_label, y_label, is_print=False):
data = pd.read_csv(file_name)
if is_print:
print(data.head())
if drop_label is not None:
data = data.drop([drop_label],axis=1)
if is_print:
print(data.head())
y_index = data.columns.get_loc(y_label)
X = data.iloc[:, y_index+1:]
y = data.iloc[:, y_index]
return X, y
# Loading datset
X, y = load_data('penguins_size.csv', None, 'species', True)
# Check for missing values in the training and test data
print("Current Missing Data")
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
print('Missing values for y dataset \n ------------------------\n', y.isnull().sum())
print(X['sex'].value_counts())
gender = X['sex'].value_counts()
print('Elements in Married variable', gender.shape)
print('Married ratio ', gender[0]/sum(gender.values))
X['sex'].fillna('MALE', inplace = True, limit = 5)
X['sex'].fillna('FEMALE', inplace = True, limit = 5)
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
print(X['culmen_length_mm'].value_counts())
print('mean loan amount ', np.mean(X['culmen_length_mm']))
culmentLength_mm_mean = np.mean(X['culmen_length_mm'])
X['culmen_length_mm'].fillna(culmentLength_mm_mean, inplace=True, limit = 2)
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
print("culmen_depth_mm")
print(X['culmen_depth_mm'].value_counts())
print("------------------------")
print("flipper_length_mm")
print(X['flipper_length_mm'].value_counts())
print("------------------------")
print("body_mass_g")
print(X['body_mass_g'].value_counts())
culmentLength_mm_mean = np.mean(X['culmen_length_mm'])
print('mean loan amount for culmen_depth_mm ', np.mean(X['culmen_depth_mm']))
print('mean loan amount for flipper_length_mm ', np.mean(X['flipper_length_mm']))
print('mean loan amount for body_mass_g ', np.mean(X['body_mass_g']))
culmentDepth_mean,flipperLenth_mean,bodyMass_mean = np.mean(X['culmen_depth_mm']),np.mean(X['flipper_length_mm']),np.mean(X['body_mass_g'])
X['culmen_depth_mm'].fillna(culmentDepth_mean, inplace=True, limit = 2)
X['flipper_length_mm'].fillna(flipperLenth_mean, inplace=True, limit = 2)
X['body_mass_g'].fillna(bodyMass_mean, inplace=True, limit = 2)
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
X.select_dtypes(['object']).columns
print(X['sex'].value_counts())
X['sex'] = X['sex'].astype("category").cat.codes -1
X.select_dtypes(['object']).columns
print(X['island'].value_counts())
X['island'] = X['island'].astype("category").cat.codes
X.select_dtypes(['object']).columns
X.head()
Data for columns 'culmen_length_mm','culmen_depth_mm','flipper_length_mm', and 'body_mass_g' were normalized since taking exponential of the dot product of training data and theta results in 'inf' due to their large values.
means = np.mean(X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']], axis=0)
stds = np.std(X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']], axis=0)
X_norm = X.copy()
X_norm[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']] = (X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']] - means) / stds
X_norm.head()
percent_train = 0.7
idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new = partition(X_norm, y, percent_train)
print('X_train.shape', X_train.shape)
print('X_test.shape', X_test.shape)
print('y_train.shape', y_train.shape)
print('y_test.shape', y_test.shape)
print('y_labels_name: ', y_labels_name)
print('y_labels_new: ', y_labels_new)
print(X_train.head())
print(y_train.head())
num_class = len(y_labels_name)
if (X_train.shape[1] == X.shape[1]):
X_train.insert(0, "intercept", 1)
if (X_test.shape[1] == X.shape[1]):
X_test.insert(0, "intercept", 1)
r,c = X_test.shape
theta_arr = []
cost_arr = []
accuracy_arr = []
alpha_arr = np.array([0.001, 0.02, 0.07, 0.1])
iterations = 450
for alpha in alpha_arr:
theta_initial = np.ones((num_class, c))
theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
cost_arr.append(cost)
theta_arr.append(theta)
r,c = X_test.shape
y_pred = []
for index,row in X_test.iterrows():
y_hat = h(row, theta, num_class)
prediction = int(np.argmax(y_hat))
y_pred.append(prediction)
accuracy = calc_accuracy(y_test, y_pred)
accuracy_arr.append(accuracy)
i = 0
plt.grid(axis='both')
for alpha in alpha_arr:
x_axis = np.arange(0,iterations)
cost = cost_arr[i]
label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
plt.plot(x_axis, cost,label=label)
i += 1
title = "Iterations vs. Cost Plot"
plt.title(title)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.legend()
plt.show()
plt.clf()
1) Data from 'Palmer Archipelago (Antarctica) penguin data' was uploaded
2) Missing Values in the data were filled in
3) Data for columns 'culmen_length_mm','culmen_depth_mm','flipper_length_mm', and 'body_mass_g' were normalized
4) Categorical columns were converted into numeric entries
5) Data was split into train and test datasets
6) Model was trained with 4 different learning rates
From iternation vs. cost plot, it is observed that increasing learning rate allows the model to learn from the data more quickly for a few number of iterations and hence reaches its convergence sooner. However, as the training progessed, the model trained with lower learning rate seemed to be catching up with the model with higher learning rate. From this observation, one way which can be used to train the model more effeciently is to first begin with a relatively large learning rate. Then as the training progresses, the learning rate of the model decreases gradually to increase the probablity of the model converging in local minima of cost function. Early stopping, discusssed briefly in exercsie 2.2, can also be added to further improve the training so longer number of iternations can be used.